prime number
Going All-In on LLM Accuracy: Fake Prediction Markets, Real Confidence Signals
Large language models are increasingly used to evaluate other models, yet these judgments typically lack any representation of confidence. This pilot study tests whether framing an evaluation task as a betting game (a fictional prediction market with its own LLM currency) improves forecasting accuracy and surfaces calibrated confidence signals. We generated 100 math and logic questions with verifiable answers. Six Baseline models (three current-generation, three prior-generation) answered all items. Three Predictor models then forecasted, for each question-baseline pair, if the baseline would answer correctly. Each predictor completed matched runs in two conditions: Control (simple correct/incorrect predictions) and Incentive (predictions plus wagers of 1-100,000 LLMCoin under even odds, starting from a 1,000,000 LLMCoin bankroll). Across 5,400 predictions per condition, Incentive runs showed modestly higher accuracy (81.5% vs. 79.1%, p = .089, d = 0.86) and significantly faster learning across rounds (12.0 vs. 2.9 percentage-point improvement from Round 1 to Round 4, p = .011). Most notably, stake size tracked confidence. "Whale" bets of 40,000+ coins were correct ~99% of the time, while small bets (<1,000 coins) showed only ~74% accuracy. The key finding is not that fictional money makes models smarter; accuracy gains were modest and did not reach statistical significance (p = .089) in this pilot. Rather, the betting mechanic created a legible confidence signal absent from binary yes/no outputs. This suggests that simple financial framing may help transform LLMs into risk-aware forecasters, making their internal beliefs visible and usable. The protocol offers a foundation for future work for meta-evaluation systems and what may become LLM-to-LLM prediction markets.
Testing Transformer Learnability on the Arithmetic Sequence of Rooted Trees
Breccia, Alessandro, Gerace, Federica, Lippi, Marco, Sicuro, Gabriele, Contucci, Pierluigi
Prime factorization, the decomposition of a natural number into its constituent primes, lies at the crossroads of arithmetic, complexity theory, and computational practice. While every integer admits a unique factorization, the operational effort required to obtain it grows quickly with its magnitude. State-of-the-art algorithms achieve remarkable performance for moderately large inputs, yet their complexity escalates rapidly when confronted with truly large instances. Moreover, in this limit, the sequence of integers with known prime factorizations becomes effectively sparse, with regions where the factorizations of intermediate values are computationally inaccessible. It is therefore natural to ask whether modern machine learning methods, and more specifically Large Language Models (LLMs), can offer any advantages from this perspective.
- Research Report > Experimental Study (0.68)
- Research Report > New Finding (0.68)
SWE-rebench: An Automated Pipeline for Task Collection and Decontaminated Evaluation of Software Engineering Agents
Badertdinov, Ibragim, Golubev, Alexander, Nekrashevich, Maksim, Shevtsov, Anton, Karasik, Simon, Andriushchenko, Andrei, Trofimova, Maria, Litvintseva, Daria, Yangel, Boris
LLM-based agents have shown promising capabilities in a growing range of software engineering (SWE) tasks. However, advancing this field faces two critical challenges. First, high-quality training data is scarce, especially data that reflects real-world SWE scenarios, where agents must interact with development environments, execute code and adapt behavior based on the outcomes of their actions. Existing datasets are either limited to one-shot code generation or comprise small, manually curated collections of interactive tasks, lacking both scale and diversity. Second, the lack of fresh interactive SWE tasks affects evaluation of rapidly improving models, as static benchmarks quickly become outdated due to contamination issues. To address these limitations, we introduce a novel, automated, and scalable pipeline to continuously extract real-world interactive SWE tasks from diverse GitHub repositories. Using this pipeline, we construct SWE-rebench, a public dataset comprising over 21,000 interactive Python-based SWE tasks, suitable for reinforcement learning of SWE agents at scale. Additionally, we use continuous supply of fresh tasks collected using SWE-rebench methodology to build a contamination-free benchmark for agentic software engineering. We compare results of various LLMs on this benchmark to results on SWE-bench Verified and show that performance of some language models might be inflated due to contamination issues.
Controlling Thinking Speed in Reasoning Models
Lin, Zhengkai, Fu, Zhihang, Chen, Ze, Chen, Chao, Xie, Liang, Wang, Wenxiao, Cai, Deng, Wang, Zheng, Ye, Jieping
Human cognition is theorized to operate in two modes: fast, intuitive System 1 thinking and slow, deliberate System 2 thinking. While current Large Reasoning Models (LRMs) excel at System 2 thinking, their inability to perform fast thinking leads to high computational overhead and latency. In this work, we enable LRMs to approximate human intelligence through dynamic thinking speed adjustment, optimizing accuracy-efficiency trade-offs. Our approach addresses two key questions: (1) how to control thinking speed in LRMs, and (2) when to adjust it for optimal performance. For the first question, we identify the steering vector that governs slow-fast thinking transitions in LRMs' representation space. Using this vector, we achieve the first representation editing-based test-time scaling effect, outperforming existing prompt-based scaling methods. For the second question, we apply real-time difficulty estimation to signal reasoning segments of varying complexity. Combining these techniques, we propose the first reasoning strategy that enables fast processing of easy steps and deeper analysis for complex reasoning. Without any training or additional cost, our plug-in module delivers an average +1.3% accuracy with -8.6% token usage across leading LRMs and advanced reasoning benchmarks. All of our algorithms are implemented based on vLLM and are expected to support broader applications and inspire future research.
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning (1.00)
- Information Technology > Artificial Intelligence > Cognitive Science > Problem Solving (0.87)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.70)
Creativity or Brute Force? Using Brainteasers as a Window into the Problem-Solving Abilities of Large Language Models
Han, Simeng, Dai, Howard, Xia, Stephen, Zhang, Grant, Liu, Chen, Chen, Lichang, Nguyen, Hoang Huy, Mei, Hongyuan, Mao, Jiayuan, McCoy, R. Thomas
Accuracy remains a standard metric for evaluating AI systems, but it offers limited insight into how models arrive at their solutions. In this work, we introduce a benchmark based on brainteasers written in long narrative form to probe more deeply into the types of reasoning strategies that models use. Brainteasers are well-suited for this goal because they can be solved with multiple approaches, such as a few-step solution that uses a creative insight or a longer solution that uses more brute force. We investigate large language models (LLMs) across multiple layers of reasoning, focusing not only on correctness but also on the quality and creativity of their solutions. We investigate many aspects of the reasoning process: (1) semantic parsing of the brainteasers into precise mathematical competition style formats; (2) generating solutions from these mathematical forms; (3) self-correcting solutions based on gold solutions; (4) producing step-by-step sketches of solutions; and (5) making use of hints. We find that LLMs are in many cases able to find creative, insightful solutions to brainteasers, suggesting that they capture some of the capacities needed to solve novel problems in creative ways. Nonetheless, there also remain situations where they rely on brute force despite the availability of more efficient, creative solutions, highlighting a potential direction for improvement in the reasoning abilities of LLMs.
- Asia (0.67)
- North America > United States (0.28)
- Europe > Austria (0.27)
- Research Report > New Finding (1.00)
- Research Report > Experimental Study (1.00)
Chain of Execution Supervision Promotes General Reasoning in Large Language Models
Chen, Nuo, Li, Zehua, Bao, Keqin, Lin, Junyang, Liu, Dayiheng
Building robust and general reasoning ability is a central goal in the development of large language models (LLMs). Recent efforts increasingly turn to code as a rich training source, given its inherent logical structure and diverse reasoning paradigms such as divide-and-conquer, topological ordering, and enumeration. However, reasoning in code is often expressed implicitly and entangled with syntactic or implementation noise, making direct training on raw code suboptimal.To address this, we introduce TracePile, a large-scale corpus of 2.6 million samples that transforms code execution into explicit, step-by-step chain-of-thought-style rationales, which we call Chain of Execution (CoE). The corpus spans domains including mathematics, classical algorithms and algorithmic competition, and is enriched with variable-tracing questions and code rewritings to enhance logical granularity and code diversity. We evaluate TracePile using three training setups: continue-pretraining, instruction tuning after pretraining, and two-stage finetuning. Experiments across four base models (LLaMA 3, LLaMA 3.1, Qwen-2.5, and Qwen-2.5 Coder) and 20 benchmarks covering math, code, logic, and algorithms demonstrate consistent improvements. Notably, TracePile boosts LLaMA3.1-8B by 7.1\% on average across nine math datasets and delivers clear gains on LiveCodeBench, CRUX, and MMLU under two-stage fine-tuning.
- North America > United States > Washington > King County > Seattle (0.04)
- Asia > China > Hong Kong (0.04)
- Asia > China > Guangdong Province > Guangzhou (0.04)
- Research Report > Experimental Study (1.00)
- Research Report > New Finding (0.92)
Concise Reasoning in the Lens of Lagrangian Optimization
Gao, Chengqian, Li, Haonan, Killian, Taylor W., She, Jianshu, Wang, Renxi, Ma, Liqun, Cheng, Zhoujun, Hao, Shibo, Xu, Zhiqiang
Concise reasoning in large language models seeks to generate only essential intermediate steps needed to arrive at a final answer, thereby alleviating issues of "over-thinking". Most proposed approaches hinge on carefully hand-crafted heuristics, struggling to balance concision with performance, often failing to adapt across domains and model scales. In this work, we address these challenges by introducing a principled and pragmatic strategy, performance-aware length updating (P ALU). As a principled algorithm, P ALU formulates concise reasoning as a constrained optimization problem, minimizing response length subject to a performance constraint, and then applies Lagrangian optimization to convert it into a tractable unconstrained problem. As a pragmatic solution, P ALU streamlines complicated update rules through three approximations: (i) estimating performance with off-policy rollouts, (ii) truncating the Lagrange multiplier to two extremes, and (iii) replacing gradient-based updates with quantile-driven length adjustments. Furthermore, P ALU is demonstrated to adapt across both domain (logic, STEM and math) and model scale (1.5B, 7B, 14B) entrenching the algorithm as a practical and effective concise reasoning approach. Reasoning, requiring large language models (LLMs) to work through intermediate steps before producing a final answer, substantially improves performance on complex tasks such as mathematics (Jaech et al., 2024; Shao et al., 2024), programming (Lambert et al., 2024), and value alignment (Guo et al., 2025). Y et this benefit is often accompanied by overthinking: redundant self-reflection, backtracking, and validation (Chen et al., 2024; Zhang et al., 2024; Fatemi et al., 2025). These limitations inflate inference costs and hampers user experience, motivating the need for concise reasoning--the production of only the essential steps required to reach a correct answer.
- North America > United States > California > San Diego County > San Diego (0.04)
- Asia > Middle East > Iraq > Basra Governorate > Basra (0.04)
Training Long-Context, Multi-Turn Software Engineering Agents with Reinforcement Learning
Golubev, Alexander, Trofimova, Maria, Polezhaev, Sergei, Badertdinov, Ibragim, Nekrashevich, Maksim, Shevtsov, Anton, Karasik, Simon, Abramov, Sergey, Andriushchenko, Andrei, Fisin, Filipp, Skvortsov, Sergei, Yangel, Boris
Research on applications of reinforcement learning (RL) to large language models has mostly been focused on single-turn problems, such as mathematical reasoning or single-shot code generation. While these problems can be viewed as token-level multi-turn Markov decision processes (MDPs), this view corresponds to a degenerate case of multi-turn interaction where the environment provides no feedback. This contrasts with many real-world domains, such as software engineering (SWE), which require rich multi-turn interactions with a stateful environment that responds to each action with a non-trivial observation. To bridge this gap, we demonstrate the successful application of RL to this general regime. Our methodology begins with rejection fine-tuning (RFT) using execution feedback to train a policy to follow instructions and formatting effectively, followed by a synchronous RL pipeline using DAPO for iterative improvement. Applying this pipeline to Qwen2.5-72B-Instruct, we increase its Pass@1 on the SWE-bench Verified benchmark from 11% to 39%, substantially improving upon the 20% RFT baseline. On the May and June splits of SWE-rebench, the resulting agent achieves Pass@1 of 35% and 31% respectively, competitive with even larger models such as DeepSeek-V3-0324 or Qwen3-235B-A22B, demonstrating that our methodology offers a practical approach for training capable agents for multi-turn interactive tasks using open-weight models.
- Workflow (0.46)
- Research Report (0.42)